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Abstract 

We  present  a  formal  specification  of  primary-backup.  We  then  prove 
lower  bounds  on  the  degree  of  replication,  failover  time,  and  worst- 
case  response  time  to  client  requests  assuming  different  failure  models. 
Finally,  we  outline  primary-backup  protocols  and  indicate  which  of 
our  lower  bounds  are  tight. 

Keywords:  Fault-tolerance,  reliability,  availability,  primary-backup,  lower 
bounds,  optimal  protocols. 

1  Introduction 

One  way  to  implement  a  fault- tolerant  service  is  by  using  multiple  servers 
that  fail  independently.  The  state  of  the  service  is  replicated  and  distributed 
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among  these  servers,  and  updates  are  coordinated  so  that  even  when  a  subset 
of  servers  fail,  the  service  will  remain  available. 

Such  fault -tolerant  services  have  been  structured  in  several  ways.  One 
approach  is  to  replicate  the  service  state  across  all  servers  and  to  present 
each  clients  request  to  all  nonfaulty  servers  in  the  same  order.  This  ser¬ 
vice  architecture  is  commonly  called  active  replication  or  the  state  machine 
approach  [Sch90]  and  has  been  widely  studied  from  both  theoretical  and 
practical  viewpoints  (e.g.,  [PSL80,  CASD85,  JB89]). 

Another  approach  to  building  replicated  services  is  to  designate  one 
server  as  the  primary  and  all  the  others  as  backups.  Clients  make  requests 
by  sending  messages  only  to  the  primary.  If  the  primary  fails,  then  a  failover 
occurs  and  one  of  the  backups  takes  over.  This  service  architecture  is  com¬ 
monly  called  the  primary-backup  or  the  primary-copy  approach  [AD76]  and 
has  been  widely  used  in  commercial  fault-tolerant  systems.  However,  the 
approach  has  not  been  analyzed  as  extensively  as  the  state  machine  ap¬ 
proach,  and  little  is  known  of  the  costs  and  tradeoffs,  the  degree  of  repli¬ 
cation  required,  and  the  worst-case  response  time  for  various  failure  mod¬ 
els.  In  this  paper,  we  derive  some  of  these  tradeoffs.  For  example,  some 
primary-backup  protocols  use  more  servers  than  the  number  of  failures  to 
be  tolerated  [LGG'^91].  We  are  able  to  show  that  the  number  of  servers 
needed  depends  on  the  failure  model. 

The  key  difference  between  the  active  replication  and  primary-backup 
approaches  is  how  each  masks  failures.  With  active  replication,  server  faul- 
ures  are  completely  masked  by  voting  and  the  service  implemented  is  that  of 
a  single  non-faulty  server.  With  the  primary-backup  approach,  a  request  to 
the  service  can  be  lost  if  it  is  sent  to  a  faulty  primary.'  Thus,  clients  can  now 
observe  the  effects  of  server  failures.  Periods  during  which  requests  are  lost, 
however,  are  bounded  by  the  length  of  time  that  can  elapse  between  failure 
of  the  primary  and  takeover  by  a  backup.  Such  behavior  is  an  instance  of 
what  we  call  a  bofo  service  {bounded  outage  finitely  often).  Specifically,  a 
service  outage  occurs  at  time  t  if  some  client  makes  a  request  at  that  time 
but  never  receives  a  response  to  that  request.^  Furthermore,  in  a  (k,  A)- 
bofo  service,  all  service  outages  can  be  grouped  into  at  most  k  intervals  of 
time,  with  each  interval  having  a  length  of  at  most  A.  Accordingly,  even 
though  some  requests  may  not  elicit  a  response  from  a  bofo  service,  not 
too  many  will.  Note  that  if  clients  are  restricted  to  send  requests  only  to 

^The  client  can  inbaequently  resend  a  copy  of  that  request  to  the  new  primary. 

^For  simplicity,  we  assume  in  this  paper  that  every  request  elicits  a  response. 
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a  single  server,  then  one  cannot  implement  a  service  that  is  stronger  than 
bofo.  This  is  because  if  the  client  sends  a  request  to  a  server  and  the  server 
subsequently  crashes,  then  the  request  can  be  lost  and  will  not  be  processed. 

In  this  paper,  we  give  lower  bounds  for  implementing  a  bofo  service 
using  the  primary-backup  approach.  These  lower  bounds  depend  on  the 
message  delivery  delay  and  the  kinds  of  failures  that  can  be  tolerated.  The 
lower  bounds  constrain  the  degree  of  replication,  the  time  during  which  the 
service  can  be  without  a  primary,  and  the  worst-case  response  time  of  client 
requests.  In  some  cases  the  results  are  surprising.  For  example,  more  than 
/  -I- 1  servers  are  necessary  to  tolerate  /  failures  of  certain  types  (crash  and 
link  failures,  receive-omission  failures,  or  general-omission  failures).  Also,  if 
a  majority  of  the  servers  can  be  faulty,  then  any  primary-backup  protocol 
for  receive-omission  failures  will  have  a  run  in  which  the  primary  is  non- 
faulty,  but  it  is  forced  to  become  a  backup,  while  a  server  that  is  faulty 
becomes  the  primary  in  its  place. 

Finally,  in  this  paper  we  outline  some  primary-backup  protocols.  This 
allows  us  to  determine  which  of  our  lower  bounds  are  tight. 

The  paper  is  organized  as  follows.  Section  2  gives  a  formal  specification 
of  a  primary- backup  protocol.  Section  3  defines  our  system  model.  Sec¬ 
tion  4  discusses  the  lower  bounds,  and  in  Section  5  we  outline  our  protocols 
and  state  which  of  the  previously-shown  bounds  are  tight.  We  conclude  in 
Section  6. 

2  Primary-Backup  Protocols 

To  derive  lower  bounds,  we  have  to  give  a  precise  definition  of  a  primary- 
backup  protocol.  We  believe  that  the  following  four  properties  characterize 
a  primary-backup  protocol  and  note  that  many  primary-backup  protocols 
{e.g.  [AD76,  BarSl,  Cen87,  BEM91])  satisfy  this  characterization. 

Pbl:  There  exists  predicate  Prmy,  on  the  state  of  each  server  s.  At  any 
time,  there  is  at  most  one  server  s  whose  state  satisfies  Prmyt? 

For  brevity,  whenever  we  say  that  “s  is  the  primary  (at  time  ty  we  mean 
that  the  state  of  s  satisfies  Prmy,.  Note  that  the  failover  time  for  a  service 
is  the  longest  period  of  time  during  which  Prmy,  is  not  true  for  any  s. 

^Tbe  protocol  o(  [LGG'*'91]  allowi  coacurrent  primaries,  but  only  for  bounded  periods. 
If  one  replaces  Pbl  by  this  property,  then  except  for  the  bounds  on  failover  times,  the 
bounds  shown  in  Section  4  continue  to  hold. 
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Pb2:  Each  client  i  maintains  a  server  identity  Vest,  such  that  to  make  a 
request,  client  i  sends  a  message  only  to  Desti. 

Property  Pb2  distinguishes  the  primary-backup  approach  from  active  repli¬ 
cation.  where  each  client  sends  requests  to  every  server  in  the  service. 

For  the  next  property,  we  model  a  communications  network  by  assuming 
that  client  requests  are  enqueued  in  a  message  queue  of  a  server. 

Pb3:  If  a  request  arrives  at  a  server  that  is  not  the  primary,  then  the  request 
is  not  enqueued  (and  is  therefore  not  processed). 

Properties  Pbl-Pb3  specify  a  protocol  for  interacting  with  a  service,  but 
not  the  semantics  of  the  service.  For  example,  the  properties  do  not  rule  out 
a  primary  that  ignores  all  requests.  A  fourth  property  eliminates  such  trivial 
implementations  by  stipulating  that  the  server  be  bofo  for  some  values  of  k 
and 

Pb4:  There  exist  fixed  values  k  and  A  such  that  the  service  behaves  like  a 
single  (A:,  A)-bofo  server. 

This  property  is  not  implementable  if  the  number  of  failures  is  not  a  priori 
bounded.  Assuming  a  bounded  number  of  failures  is  just  a  modeling  trick. 
When  the  number  of  failures  is  unbounded,  bounding  the  rate  of  failures 
and  including  reintegration  of  recovered  servers  can  provide  service  outages 
of  bounded  lengths.  We  do  not  address  failure  rates  or  reintegration  in  this 
paper. 

A  Simple  Primary-Backup  Protocol 

As  an  example  of  a  service  based  on  the  primary-backup  approach,  consider 
the  following  protocol,  which  tolerates  a  single  server  crash.  Assume  that  all 
communication  is  over  point-to-point  nonfaulty  links  and  that  each  link  has 
an  upper  bound  6  on  message  delivery  time^.  Refer  to  Figure  1.  There  is 
a  primary  server  p\  and  a  backup  server  connected  by  a  communications 
link.  A  client  initially  sends  all  requests  to  pi  (indicated  by  the  arrow  labeled 
1  in  the  figure).  Whenever  pi  receives  such  a  request,  it 

•  processes  the  request  and  updates  its  state  accordingly, 

*To  amplify  expomtion,  w«  utame  that  the  maximum  mewage  delay  between  the 
clients  and  the  servers  is  the  same  as  the  delay  between  the  servers.  However,  our  results 
can  be  easily  extended  to  the  case  when  the  delays  are  different. 
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Figure  1:  A  Simple  Primary-Backup  Protocol. 


•  sends  information  about  the  state  update  to  p2  (message  2  in  the 
figure), 

•  without  wsuting  for  an  acknowledgement  from  p^,  sends  a  response  to 
the  client  (message  3  in  the  figure). 

The  order  in  which  these  messages  are  sent  is  important  because  it  guaran¬ 
tees  that  if  the  client  receives  a  response,  then  either  pi  has  received  message 
2  or  p2  has  crashed. 

Server  updates  its  state  upon  receiving  update  messages  from  pi .  In 
addition,  pi  sends  messages  to  p2  every  r  seconds.  If  p2  does  not  receive 
such  a  message  for  r  -f-  ^  seconds,  then  pi  becomes  the  primary.  Once  p2 
has  become  the  primary,  it  informs  the  clients  (who  update  their  copies  of 
Dest)  and  begins  processing  any  subsequent  requests  sent  by  them. 

We  now  show  that  this  protocol  satisfies  our  characterization  of  a  primary- 
backup  protocol.  Property  Pbl  requires  that  there  never  be  two  primaries. 
This  is  satisfied  by  the  following  definitions  of  Prmy: 

<i€f 

Prmifp,  =  (pi  has  not  crashed) 

Prmyipj  =  {pi  has  not  received  a  message  for  r  -|-  ^) 

The  predicate  Prmyj,^  A  Prmy^  is  always  false  in  a  system  executing  our 
protocol,  and  hence  Pbl  is  satisfied.  The  failover  time  for  this  protocol  is 
the  longest  interval  during  which  -<Prmyp^  A  -iPrmyp^  can  hold,  and  it  is 
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r  +  2^  seconds.  Property  Pb2  follows  trivially  from  the  description  of  the 
protocol.  Property  Pb3  is  true  because  requests  are  not  sent  to  ^2  until 
after  p\  has  failed.  Finally,  Pb4  requires  that  the  protocol  implements  a 
single  bofo  server  for  some  values  of  k  and  A.  Since  pi  sends  message  2 
betore  message  3.  it  will  never  be  the  case  that  pi  sends  a  response  to  the 
client,  and  p^  does  not  get  information  about  that  response  from  pj.  Using 
this  fact,  it  can  be  shown  that  the  service  behaves  like  a  single  server.  To 
compute  k  and  A,  we  can  let  A:  =  1  and  so  it  suffices  to  compute  the  longest 
interval  during  which  a  client  request  may  not  elicit  a  response.  Assume 
that  Pi  crashes  at  time  tc-  Any  request  sent  at  ^  or  later  may  be  lost 
since  pi  crashes  at  tc-  Furthermore,  pj  may  not  learn  about  pi’s  crash  until 
tc  +  T  +  26,  and  clients  may  not  learn  that  p^  is  the  primary  for  another 
6.  So,  the  total  period  during  which  a  request  may  not  elicit  a  response  is 
tc-6  through  tc  +  r  +  3^:  the  service  is  equivalent  to  a  single  (1,  r  +  4<5)-bofo 
server. 

3  The  Model 

We  consider  a  system  consisting  of  n,  servers  and  ric  clients.  We  assume 
that  server  clocks  are  perfectly  synchronized  with  real  time.®  Clients  and 
servers  communicate  by  exchanging  messages  through  a  completely  con¬ 
nected  point-to-point  network.  Each  message  sent  is  enqueued  in  a  queue 
maintained  by  the  receiving  process,  and  a  process  accesses  its  message 
queue  by  executing  receive.  We  assume  that  links  between  processes  are 
FIFO  (i.e.  if  pi  sends  message  m  followed  by  m'  to  process  pj,  then  pj  will 
never  receive  m  after  m')  and  if  processes  pi  and  pj  are  connected  by  a  (non- 
faulty)  link,  then  a  message  sent  from  pi  to  pj  at  time  t  will  be  enqueued  in 
Pj’s  queue  at  of  before  t  +  6. 

We  are  interested  in  identifying  the  costs  inherent  in  primary-backup 
protocols,  and  so  we  assume  that  it  takes  no  time  for  a  server  to  compute  a 
response.  We  also  assume  that  a  client  can  send  a  request  at  any  time. 

We  model  execution  of  a  system  by  a  run,  which  is  a  sequence  of  times- 
tamped  events  involving  clients,  servers,  and  the  message  queues.  These 
events  include  sending  messages,  enqueuing  messages,  receiving  messages, 
and  modeling  computation  at  processes.  Two  runs  <7i  and  (Tj  of  the  system 
are  indistinguishable  to  a  process  p  if  the  same  sequence  of  events  (with  the 

^Extennon  to  the  caae  where  clocks  we  only  eppioximately  synchronized  [LMS85]  is 
discussed  in  [Bad93]. 
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same  timestamps)  occur  at  p  in  both  <Ti  and  cr2.  We  assume  that  if  two  runs 
(y\  and  <T2  are  indistinguishable  to  p,  then  at  any  time  t.  the  state  of  p  at 
time  t  in  <Ti  is  the  same  as  the  state  ofp  at  time  t  in  <72-  Again,  it  is  not  hard 
to  extend  our  definition  of  indistinguishability  to  handle  nondeterministic 
servers;  the  current  definition  does  not. 

We  consider  the  following  hierarchy  of  failure  models: 

Crash  failures:  A  server  may  fail  by  halting  prematurely.  Until  it  halts,  it 
behaves  correctly.  After  it  halts,  a  timeout  can  detect  this  fact.® 

Crash-fLink  failures:  A  server  may  crash  or  a  link  may  lose  messages  (but 
not  delay,  duplicate  or  corrupt  messages). 

Receive- Omission  failures:  A  server  may  fail  not  only  by  craishing,  but  also 
by  omitting  to  receive  some  of  the  messages  sent  to  it  over  a  nonfaulty 
link. 

Send- Omission  failures:  A  server  may  fail  not  only  by  crashing,  but  also  by 
omitting  to  send  some  of  the  messages  over  a  nonfaulty  link. 

General- Omission  failures:  A  server  may  exhibit  send-omission  and  receive- 
omission  failures. 

Figure  2  illustrates  this  failure  hierarchy.  Note  that  crash+link  failures 
and  the  various  types  of  omission  failures  are  distinct.  Although  both  rep¬ 
resent  loss  of  messages,  each  is  dealt  with  by  a  different  masking  technique. 
In  particular,  crash-blink  failures  can  be  masked  by  adding  redundant  com¬ 
munication  paths,  while  omission  failures  can  only  be  masked  by  adding 
sufficient  redundant  servers  so  that  faulty  processes  can  detect  their  own 
failure  and  halt.  We  discuss  these  masking  techniques  in  Section  5. 

Henceforth,  we  assume  that  no  more  than  /,  servers  can  be  faulty,  and 
for  crash-blink  failures  that  no  more  than  fi  links  can  be  faulty. 

4  Lower  Bounds 

We  now  give  lower  bounds  for  implementing  a  single  {k,  A)-bofo  server  using 
the  primary-backup  approach  for  each  failure  model. 

*Tlie  lower  bounds  we  derive  for  crash  failures  also  hold  for  fail-stop  failures  [SS83] 
except  for  the  bound  on  failover  time.  The  lower  bound  on  failover  time  depends  on  the 
maximum  duration  between  when  a  server  p,  fails  and  when  fatltd,  becomes  true. 


general 
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crash -f  link 


crash 


Figure  2:  Failure  Hierarchy 

•4.1  Bounds  on  Replication 

The  first  theorem  is  obvious.  However,  to  introduce  our  notation  and  the 
proof  technique  that  will  be  used  later  in  the  section,  we  give  a  formal  proof 
of  the  theorem. 

Theorem  1  Any  primary-backup  protocol  tolerating  f,  crash  failures  re¬ 
quires  n,  >  /,  +  1. 

Proof:  We  prove  the  result  by  contradiction.  Suppose  there  is  a  protocol 

P  for  n,  <  /,  -f  1.  Thus,  P  satisfies  Pb4.  Consider  a  run  in  which  all  n, 
servers  are  crashed  initially  and  clients  submit  R  >  requests,  where 

d  is  the  minimum  time  between  the  sending  of  any  two  requests  (d  0).  By 
Pb4,  at  least  one  of  these  requests  must  elicit  a  response.  This  is  because 
the  number  of  requests  that  cannot  have  responses  must  fall  into  at  most  k 
intervals  of  length  at  most  A,  and  each  interval  of  A  can  contain  at  most 
[A/d]  requests.  However,  such  a  response  is  impossible  since,  by  assump¬ 
tion,  all  servers  have  crashed.  □ 

The  following  lemma  will  be  used  for  the  rest  of  the  theorems  in  this 
section. 

Lemma  4.1  Consider  any  protocol  that  satisfies  Pb4.  Suppose  two  disjoint 
and  nonempty  sets  of  servers  A  and  B  can  be  found  that  meet  the  following 
three  properties: 
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1.  There  exists  a  run  containing  R>  2k{ A/ d]  requests  where  d  is  the 
minimum  time  between  the  sending  of  any  two  client  requests  (d  >  0). 
Furthermore,  in  this  run  the  servers  in  A  do  not  crash  and  all  other 
servers  crash  at  time  0. 

2.  There  exists  a  run  cn,  containing  R  requests.  Furthermore,  in  this  run 
the  servers  in  B  do  not  crash  and  all  other  servers  crash  at  time  0. 

3.  There  exists  a  run  <7^1,  containing  R  requests.  Furthermore,  the  servers 
in  A  and  B  do  not  crash,  (Tab  w  indistinguishable  from  (t^  to  all  servers 
in  .4,  and  (Tab  w  indistinguishable  from  <T(,  to  all  servers  in  B. 

.At  least  one  of  the  above  runs  violates  Pb2. 

Proof:  Suppose  for  contradiction  that  the  lemma  is  false  and  runs  <Ti 

and  (Tab  all  satisfy  Pb2. 

For  (Ta,  by  Pb4  at  least  R  —  ArfA/d]  of  the  requests  must  have  been  re¬ 
ceived  by  servers  in  A.  Similarly,  for  at  least  R  —  fcfA/(f|  of  the  requests 
must  have  been  received  by  servers  in  B.  Finally,  since  Oab  is  indistinguish¬ 
able  from  Oa  to  servers  in  /4,  they  must  execute  the  same  number  of  receive 
events  in  both  runs.  The  same  holds  for  the  servers  in  B.  By  Pb2,  each 
request  is  sent  to  at  most  one  server  and  so  at  least  2(R  -  itfA/d])  requests 
must  have  been  sent  in  (Tab-  Since  only  R  requests  were  sent,  we  must  have 
R  >  2(R  -  AfA/d]),  or  R  <  2A;fA/<fj,  which  contradicts  the  assumption 
that  R  >  2A;fA/dl. 

□ 

Theorems  2  and  3  depend  on  two  parameters  of  primary-backup  proto¬ 
cols.  Let  r  be  the  maximum  time  between  any  two  successive  client  requests 
(possibly  from  different  clients)  in  any  run  of  the  system,  and  let  D  be  a. 
duration  such  that  if  some  server  s  becomes  the  primary  at  time  to  and  re¬ 
mains  the  primary  through  time  t  >  to  +  D  when  a  client  c,  sends  a  request, 
then  Desti  =  s  at  time  t.  For  simplicity  of  notation,  we  will  write  Z?  <  F  to 
mean  that  D  is  bounded  and  F  is  either  unbounded  or  bounded  and  greater 
than  D. 

With  both  send-omission  failures  and  crash-l-link  failures,  messages  may 
fail  to  reach  their  intended  destinations.  The  following  theorem  shows  that 
crash-f  link  failures  are  more  expensive  to  tolerate  as  they  require  more  repli¬ 
cation. 
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Theorem  2  Suppose  there  is  at  most  one  link  between  any  tiro  servers 
and  the  total  number  of  server  and  link  failures  that  can  occur  is  /,  where 
f  <  min(  f,.  fi).  Then  any  primary-backup  protocol  tolerating  crash-i-link 
failures  and  having  D  <  T  requires  n,  >  f  +  2. 

Proof;  For  contradiction,  assume  the  existence  of  a  protocol  P  with 
n,  <  /  +  2.  We  will  show  that  P  has  three  runs  (Ta,  cri,  and  cr^b  that  satisfy 
the  conditions  of  Lemma  4.1.  From  the  lemma,  at  least  one  of  these  runs 
violates  Pb2,  which  implies  that  P  cannot  be  a  primary-backup  protocol. 

Let  /i  be  a  set  containing  the  one  server  and  let  B  be  the  set  of 
remaining  servers.  Since  |,4|  =  1  and  1B|  =  n,  -  1  <  /,  ,4  and  B  can  be 
partitioned  by  link  failures. 

We  first  construct  the  run  (Tab  in  which  no  server  crashes,  postulating 
that  the  links  between  the  servers  in  n,  and  nj  are  faulty  and  do  not  deliver 
any  messages.  As  required  by  Lemma  4.1,  clients  will  send  a  total  of  iZ  > 
2k\^/(I\  requests.  Let  0<d<r-Dbe  the  minimum  interval  between  any 
two  such  requests.  We  postulate  that  a  request  will  be  sent  at  time  t  iff  no 
request  has  been  sent  during  the  interval  [t  -  d..t)  and  one  of  the  following 
rules  hold. 

1.  A  server  s  is  the  primary  during  the  interval  [t  -  D..t].  This  request 
arrives  immediately  and  is  enqueued  (at  s,  by  Pb3  and  the  definition 
of  £?). 

2.  There  is  no  primary  at  time  t.  This  request  arrives  immediately  and 
by  Pb3  will  never  be  enqueued  at  any  server. 

3.  A  server  s  is  the  primary  at  time  t  but  another  server  s'  is  the  primary 
immediately  after  time  t.  If  this  request  is  sent  to  s,  then  it  arrives 
after  t,  and  if  it  is  sent  to  any  other  server,  then  it  arrives  immediately. 
In  both  cases,  it  arrives  at  a  server  that  is  not  the  primary,  and  so  will 
not  be  enqueued  (again  by  Pb3). 

Note  that,  by  construction,  the  maximum  interval  between  any  two  client 
requests  is  D  +  d.  This  interval  occurs  when  a  server  s  becomes  the  primary 
just  before  d  after  a  client  message  is  sent,  and  s  remains  the  primary  for 
at  least  D.  Hence,  the  client  will  be  able  to  send  R  requests  within  time 
R{D  +  d).  This  completes  the  construction  of  (Tab- 

We  now  construct  <To  and  (Tb,  recalling  that  in  a,  aJl  of  the  servers  except 
Sa  crash  at  time  0.  and  in  <74  server  So  crashes  at  time  0.  The  clients  send  the 
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same  requests  and  at  the  same  times  in  cr^  and  in  ai,  as  in  cr^j.  Furthermore, 
by  construction  these  requests  will  arrive  at  the  servers  according  to  the 
same  rules  used  in  constructing  cr^i,-  Of  course,  a  client  request  may  not  be 
delivered  to  the  same  servers  in  or  at,  as  in  since  different  servers  are 
operational  in  these  runs. 

Since  s*  does  not  receive  any  messages  from  servers  in  B  in  either  crab 
or  (Ta,  these  two  runs  are  indistinguishable  to  Sa  as  long  as  it  receives  the 
same  client  requests  at  the  same  times  in  both  runs.  We  that  this  is  the  case 
by  contradiction:  let  t  be  the  earliest  time  that  Sa  can  distinguish  between 
these  two  runs. 

Thus,  at  time  t  either  Sa  received  a  request  m  in  <734  but  not  in  cTa  or  it 
received  a  request  m  in  da  but  not  in  <734.  We  wiU  assume  the  former;  the 
proof  for  the  latter  is  similar.  The  request  m  must  have  been  enqueued  at 
some  time  t'  <  t  at  Sa  in  a  ah-  Since  m  was  received  by  Sa,  m  must  have  been 
sent  by  rule  1.  By  rule  1,  Sa  must  have  been  the  primary  through  [t'  -  D..t'] 
in  dab  and  therefore,  by  iudistinguishability,  in  da  as  well.  By  the  definition 
of  D.  m  would  have  been  enqueued  at  Sa  at  time  t'  in  <7*  as  well. 

Since  Sa  cannot  distinguish  between  the  runs  before  t,  Sa  cannot  receive 
m  before  t  in  <Ta,  and  Sa  must  execute  a  receive  in  both  <73  and  <734  at  time 
t.  So,  it  must  be  the  case  that  S3  receives  another  request  m'  ^  m  time  t 
in  <73.  Assume  that  m'  was  enqueued  at  time  t" .  By  an  indistinguishability 
argument  similar  to  above,  m'  must  be  enqueued  at  time  t"  at  Sa  in  dab  as 
well.  Therefore,  if  s  received  m!  in  <7a  at  time  t,  it  must  receive  m'  in  dab  as 
well,  a  contradiction. 

A  similar  argument  can  be  used  to  show  that  the  servers  in  rib  receive 
the  same  requests  in  <74  and  dab,  and  so  these  two  runs  are  indistinguishable 
to  the  servers  in  n4.  Thus,  by  Lemma  4.1  P  cannot  be  a  primary-backup 
protocol.  □ 

The  next  theorem  states  that  additional  replication  is  required  in  order  to 
tolerate  receive-omission  failures.  The  proof  is  similar  to  that  of  Theorem  2, 
and  so  it  is  omitted. 

Theorem  3  Any  primary-backup  protocol  tolerating  receive-omission  fail¬ 
ures  and  having  D  <  T  requires  n,  >  . 

The  next  lower  bound  holds  independent  of  the  relation  between  D  and 
r.  However,  before  we  prove  the  result,  we  need  the  following  definitions. 

Define  -<  to  be  the  potential  causality  relation  [Lam78]  on  server  events 
Cl  and  ej  as  follows:  «i  -<  ej  iff 
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1.  Both  ei  and  62  occur  at  the  same  server  s  and  ei  occurs  before  ct  or 

2.  ei  is  a  send  event  and  €2  is  the  corresponding  receive  event  or 

3.  (9e:  ei  x  c  A  e  -<  C2) 

We  say  that  a  request  m  is  an  update  request  iff  in  any  run  a  for  which  m 
has  a  response  r,  any  other  response  r'  sent  after  r  in  real  time  causally 
follows  m,  i.e.  if  event  e(m)  corresponds  to  the  receipt  of  m  and  event 
e(r')  corresponds  to  the  sending  of  r',  then  e(m)  -<  e(r').  A  primary-backup 
protocol  is  trivial  to  implement  if  there  are  no  update  requests,  and  so  we 
assume  that  update  requests  exist  and  that  clients  can  send  them  at  any 
time. 

Theorem  4  Ar\y  primary-backup  protocol  tolerating  general-omission  fail¬ 
ures  requires  n,  >  2/,. 

Proof:  Assume  for  contradiction  that  there  is  a  protocol  for  n,  <  2/,. 

Partition  the  servers  into  two  disjoint  sets  A  and  B  of  size  at  most  /,  each. 
We  will  construct  two  runs  a\  and  <T2.  In  each  run,  one  set  of  servers  will 
be  faulty  and  the  other  set  will  be  nonfaulty. 

<T\:  The  servers  is  A  are  faulty  and  fail  to  communicate  with  all  servers 
in  fl,  but  behave  correctly  otherwise.  Clients  send  update  requests  until 
the  first  response  is  sent  (this  must  happen,  by  Pb4).  Assume  that  the  first 
response  r  to  a  request  is  sent  at  time  t.  Say  that  this  response  is  sent  by 
server  s. 

02'.  The  same  as  (j\  up  to  time  t,  but  if  s  is  in  B,  then  in  <72  the  servers 
in  B  are  faulty  and  fail  to  communicate  with  all  servers  in  A.  In  either  case, 
no  server  can  distinguish  <Ti  from  <72  through  time  t  and  therefore,  the  first 
response  r  is  sent  at  time  t  in  <72  as  well. 

By  construction,  r  is  sent  by  a  faulty  server  in  (72-  Let  all  of  the  faulty 
servers  in  <72  crash  immediately  after  r  is  sent  and  have  clients  continue  to 
send  requests  until  another  response  r'  is  sent.  This  response  must  have 
been  sent  by  a  nonfaulty  server  which  implies  that  -i(e(m)  ■<  e(r')).  How¬ 
ever  this  violates  the  fact  that  m  is  an  update  request.  □ 


4.2  Bounds  on  Blocking 

Informally,  a  blocking  primary-backup  protocol  is  one  in  which  the  primary 
must,  subsequent  to  receiving  a  request  m,  either  receive  a  message  from 
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another  server  or  simply  wait  an  interval  before  it  can  respond  to  m.  We 
say  that  a  primary-backup  protocol  is  C -blocking  if  any  request  (received, 
say.  at  tm)  elicits  a  response  in  a  failure-free  run  at  time  tr.  then  tr  -  tm  < 
C.  For  example,  any  primary-backup  protocol  in  which  the  primary  sends 
information  about  a  request  to  the  backups  and  waits  for  acknowledgement 
before  sending  the  response  to  the  client  will  be  at  least  •2(5-blocking. 

As  shown  in  Section  5,  0-blocking  primary-backup  protocols  are  possible 
for  crash  and  crash+Unk  failure  models.  The  simple  protocol  tolerating 
crash  failures  presented  in  Section  2  is  0-blocking.  We  call  such  protocols 
nonblocking  because  the  primary  can  send  a  reply  to  the  client  as  soon  as 
the  reply  is  computed.  Nonblocking  protocols  tolerating  receive-omission 
failures  are  also  possible  as  long  as  n,  >  2/,,  but  there  is  no  nonblocking 
primary-backup  protocol  tolerating  send-omission  failures. 

Theorem  5  Any  primary-backup  protocol  tolerating  receive-omission  fail¬ 
ures  with  f,  >  1,  n,  <  2/j  and  D  <  T  is  C -blocking  for  some  C  >  26. 


Proof;  For  contradiction,  suppose  there  is  a  primary-backup  protocol 
for  n,  <  2/,  and  /,  >  1  that  is  C-blocking  where  C  <  26.  Partition  the 
servers  into  two  sets  A  and  B  where  |Aj  =  f,  and  |5|  n,  -  ft  <  f,.  We 
construct  three  runs.  In  all  three  runs,  assume  that  all  server  messages  take 
6  to  arrive. 

cr\:  There  are  no  failures  and  all  client  requests  take  6  to  arrive.  More¬ 
over,  clients  send  update  requests  until  some  request  m  evokes  a  response 
r.  Let  m  be  received  at  time  tm  by  server  p  €  A  and  r  be  sent  at  time  U 
by  a  different  server  q  ^  A.  Notice  that  since  the  protocol  is  C-blocking 
where  C  <  26,  tr  -  tm  <  26.  Also,  since  by  construction  all  requests  take 
6  to  arrive,  all  client  requests  sent  after  time  tm  +  6  will  be  received  after 
time  tr- 

<72'.  Identical  to  ci  until  p  receives  m  at  time  t,„.  At  this  point  in  oj,  all 
servers  in  A  are  assumed  to  crash  and  clients  are  assumed  to  send  no  request 
during  the  interval  (t,„  +  rf..t,].  Finally,  after  time  U  clients  are  assumed  to 
repeatedly  send  requests  at  intervals  of  at  least  d  where  0<d<r  —  Das 
follows.  A  request  is  sent  at  time  t  if  no  request  has  been  sent  in  [t  -  d..t) 
and  one  of  the  following  rules  hold. 

1.  A  server  s  £  B  is  the  primary  during  the  interval  [t  -  D..t].  This 
request  arrives  immediately  and  is  enqueued  (at  s,  by  Pb3  and  the 
definition  of  D). 
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2.  There  is  no  primary  in  B  at  time  t.  This  request  arrives  immediately 
by  Pb3  will  never  be  enqueued  at  any  server. 

3.  .4  server  s  €  is  the  primary  at  time  t  but  another  server  s'  ^  B 
is  the  primary  immediately  after  time  t.  If  the  request  is  sent  to  s. 
then  it  arrives  after  t,  and  if  it  is  sent  to  any  other  server  it  arrives 
immediately.  In  both  ca^es.  it  arrives  at  a  server  that  is  no  the  primary, 
and  so  will  not  be  enqueued  (again,  by  Pb3). 

Notice  that  eventually,  there  will  be  a  response  (say  r')  in  <72  because 
the  protocol  satisfies  Pb4,  and  by  construction  it  must  be  from  a  request 
sent  by  rule  1. 

<73:  The  same  as  <73 1  except  that  the  servers  in  .4  do  not  crash  at  time 
tm-  Instead,  the  servers  in  B  commit  receive  failures  on  all  messages  sent 
after  tm  by  servers  in  A.  Clients  send  requests  at  the  same  times  as  in  ctj 
which  arrive  using  the  same  rules  as  <73. 

Now,  consider  these  three  runs.  By  construction,  the  runs  are  identical 
up  to  time  t.  Since  ail  server  messages  tadce  6  to  arrive,  clients  cannot  dis¬ 
tinguish  <Ti  and  <T3  through  tm  +  6,  and  so  clients  send  the  same  requests  to 
the  same  servers  in  both  <Ti  and  <73.  Similarly,  since  all  server  messages  take 
6  to  arrive,  the  servers  in  B  cannot  distinguish  between  Ui  and  (T3  through 
tm  +  Therefore,  since  U  -  tm  <  26,  p  (the  server  that  received  request 
m  at  time  tm  in  <7j)  and  q  (the  server  that  sent  response  r  at  time  U  in 
<7i)  cannot  distinguish  between  <7i  and  <73  through  time  tr,  and  so  q  sends 
response  r  in  <73  as  well.  Then,  using  an  ugument  similar  to  the  one  in 
Theorem  2,  servers  in  B  cannot  distinguish  <72  and  £73,  auid  so  response  r' 
also  occurs  in  <73.  However,  -<(e(m)  <  e(r'))  which  violates  the  assumption 
that  m  is  an  update  request.  □ 

In  run  (73  of  the  above  proof,  a  correct  primary  (p  in  set  A)  becomes 
the  backup,  while  a  faulty  server  from  set  B  becomes  the  primary  in  p's 
place.  It  is  always  possible  to  construct  such  a  run.  This  is  a  disconcerting 
property:  there  does  not  exist  a  primary-b2tckup  protocol  that  tolerates 
receive-omission  failures  with  n,  <  2f,  in  which  a  primary  cedes  only  when 
it  fails.  Moreover,  his  lower  bound  is  tight — we  have  constructed  a  receive- 
omission  primary-backup  protocol  with  n,  =  2/,  -|-  1  in  which  a  primary 
cedes  only  when  it  fails. 

The  above  lower  bound  holds  only  if  /,  >  1.  If  /,  =  1,  then  the  following 
theorem  holds.  Its  proof  is  similar  to  the  proof  of  Theorem  5,  except  that 
p  =  q. 
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Theorem  0  Any  primary-backup  protocol  tolerating  receive-omisswn  fail¬ 
ures  with  fs  =  1  and  n,  <  2f,  and  having  D  <  T  is  C -blocking  for  some 
C  >S. 

Primary-backup  protocols  tolerating  send-omission  failures  exhibit  the 
same  blocking  as  those  tolerating  receive-omission  failures: 

Theorem  7  .4nj/  primary-backup  protocol  tolerating  send-omission  failures 
and  /,  >  1  is  C  -blocking  for  some  C  > ‘26. 

Proof:  For  contradiction,  suppose  there  is  a  primary-backup  protocol 

that  is  C-blocking  where  C  <  26.  We  consider  the  following  two  runs  in 
which  all  server  messages  take  6  to  arrive. 

<Ti:  There  are  no  failures  and  all  client  requests  take  6  to  arrive.  More¬ 
over,  clients  send  update  requests  until  some  request  m  evokes  a  response  r. 
Let  m  be  received  at  time  by  server  p  and  r  be  sent  at  time  t,.  by  a  dif¬ 
ferent  server  q.  Notice  that  since  the  protocol  is  C-blocking  where  C  <  2^, 
tr  -  tm  <  26.  Also,  since  by  construction  all  requests  take  6  to  arrive,  all 
client  requests  sent  after  time  tm  +  ^  will  be  received  after  time  C  • 

<72^  Identical  to  <Ti  through  tm-  After  tm,  p  <tnd  q  fail  and  omit  to  send 
all  messages  to  all  servers  except  each  other.  Since  by  construction  all  mes¬ 
sages  take  6  to  arrive,  servers  and  clients  cannot  distinguish  between  <ti  and 
<T2  through  tm  +  6,  and  as  a  result  p  and  q  cannot  distinguish  the  two  runs 
through  tm  +  26.  Therefore,  since  t,  -  tm  <  2^,  q  sends  the  response  r  at 
time  tr  in  <72  as  well.  Now  let  p  and  q  crash  at  time  tr  and  the  clients  send 
requests  after  time  t,.  By  Pb4,  there  eventually  must  be  some  request  m' 
that  results  in  a  response  r'.  However,  ->(e(m)  ■<  c(r')),  which  violates  the 
assumption  that  m  us  an  update  request.  □ 


Theorem  8  Any  primary-backup  protocol  tolerating  send-omission  failures 
and  /,  =  1  is  C-blocking  for  some  C  >  6. 

4.3  Bounds  on  Failover  Times 

The  failover  time  is  the  longest  interval  during  which  Prmy,  is  not  true  for 
any  server  s.  In  this  section,  we  present  lower  bounds  on  failover  times. 
In  order  to  discuss  these  bounds,  we  postulate  a  fifth  property  of  primary- 
backup  protocols: 

Pb5:  A  server  that  is  the  primary  remains  so  until  there  is  a  failure. 
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This  is  a  reasonable  expectation  and  it  is  valid  for  all  protocols  that  we  have 
found  in  the  literature. 

Theorem  9  Any  primary-backup  protocol  tolerating  f,  crash  failures  must 
have  a  failover  time  of  at  least  f,6. 

Proof:  Assume  that  the  theorem  is  false.  We  derive  a  contradiction  by 

induction  on  /,. 

Base  case  /,  =  0;  trivially  true  since  the  failover  time  cannot  be 
smaller  than  zero. 

Induction  case  >  0;  suppose  the  theorem  holds  for  at  most  /,  -  1 
failures,  but  there  is  a  protocol  P  for  which  the  theorem  is  false  when  there 
are  /,  failures.  From  the  induction  hypothesis,  there  is  a  run  with  at  most 
/,  -  1  failures  and  an  interval  a*  (A  “  1)*^  during  which  there  is 

no  primary.  Let  pi  be  the  server  that  becomes  the  primary  at  ti .  Consider 
the  two  runs  <Ti  and  trj  that  extend  a  as  follows: 

a\\  Assume  p\  crashes  at  time  ti.  By  assumption,  there  exists  a  new 
primary  (say  pj)  at  time  tj  <  ti  +  6.  Since  pi  crashes  at  time  tj,  p2  does 
not  receive  any  messages  from  p\  that  were  sent  sent  after  time  . 

(Tj:  Assume  pi  does  not  crash  but  all  messages  sent  by  pi  after  time  ti 
take  6  to  arrive. 

Since  pj  cannot  distinguish  <ri  from  <T2  through  time  Pz  becomes  the 
primary  at  time  in  <T2.  By  Pb5,  however,  pi  remains  the  primary  at  time 
t2  in  <T2.  This  violates  Pbl,  and  so  P  is  not  a  primary-backup  protocol.  □ 

The  failover  times  for  all  other  failure  models  have  a  larger  lower  bound. 

Theorem  10  Any  primary-backup  protocol  tolerating  f  crash-l-link  failures, 
where  f  <  min{ft,  ft),  has  a  failover  time  of  at  least  2f6. 

Proof:  We  again  assume  that  the  theorem  is  false  and  derive  a  contra¬ 

diction. 

Base  case  f  =  0:  trivially  true. 
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Induction  case  /  >  0:  suppose  the  theorem  holds  for  at  most  /  -  1 
failures,  but  there  is  a  protocol  P  for  which  the  theorem  is  false  when  there 
are  /  failures. 

From  the  induction  hypothesis,  there  is  a  run  cr  with  at  most  /  -  1 
fadlures  and  an  interval  [to..<i]  at  least  (  /  -  1)<5  during  which  there  is  no 
primary.  Let  pi  be  the  server  that  becomes  the  primary  at  fj.  Consider  the 
three  runs  <7i,  crj  and.  (73  that  extend  <t  as  follows: 

crj:  .Assume  that  pi  crashes  at  time  ti  and  that  all  messages  sent  after  ti 
take  6  to  arrive.  By  assumption,  there  exists  a  new  primary  (say  P2)  at  time 
t2  <  ti  +  26.  Since  pi  crashes  at  time  ti,  p2  does  not  receive  any  messages 
from  Pi  that  were  sent  after  time  ti.  Furthermore,  since  all  messages  take  6 
to  airrive,  any  message  that  was  sent  after  tj  +  i  can  be  received  by  p2  only 
after  time  tj. 

Assume  that  pi  does  not  crash  and  that  all  messages  sent  after  time 
ti  take  6  to  arrive.  Since  there  are  no  failures  after  time  ti,  by  Pb5  pi 
continues  to  be  the  primary  through  time 

ay.  The  same  as  <72  except  that  the  link  between  pi  and  p2  is  faulty  and 
does  not  deliver  any  message  sent  by  pi  to  p2  after  time  ti. 

By  construction,  cannot  distinguish  <Ti  from  <73  through  time  t2,  and 
so  p2  becomes  the  primary  at  time  <2  in  Similarly,  pi  cannot  distinguish 
<72  from  <73  through  time  <2  and  so  pi  remains  the  primary  until  time  (2  in 
<73.  This  violates  Pbl,  and  so  P  is  not  a  primary-backup  protocol.  □ 

We  omit  the  proofs  of  the  following  two  theorems  because  they  are  similar 
to  Theorem  9. 

Theorem  11  Any  primary-backup  protocol  tolerating  f,  receive-omission 
failures  has  a  failover  time  of  at  least  2f,6. 

Theorem  12  Any  primary-backup  protocol  tolerating  f,  send-omission  fail¬ 
ures  has  a  failover  time  of  at  least  2f,6. 

5  Outline  of  the  Protocols 

In  order  to  establish  that  the  bounds  given  above  are  tight,  we  have  de¬ 
veloped  a  set  of  primary-backup  protocols  for  the  different  failure  mod¬ 
els  [BMST92].  In  this  section,  we  outline  these  protocols  and  use  them  to 
show  which  of  the  lower  bounds  in  the  previous  sections  are  tight. 
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Our  protocol  for  crash  failures  is  similar  to  the  protocol  given  in  Sec¬ 
tion  2.  Whenever  the  primary  receives  a  request  from  the  client,  it  processes 
that  request  and  sends  information  about  state  updates  to  the  backups  be¬ 
fore  sending  a  response  to  the  client.  .A.!!  servers  periodically  send  messages 
to  each  other  in  order  to  identify  server  failures.  This  protocol  uses  (/,  +  1) 
servers  and  is  0-blocking.  Thus,  Theorem  1  is  tight  and  this  protocol  uses 
the  optimal  number  of  servers  and  incurs  no  additional  delay.  Furthermore, 
this  protocol  has  the  failover  time  +  r  for  arbitrarily  small  and  positive 
T.  and  so  Theorem  9  is  tight. 

In  order  for  the  protocol  to  tolerate  crash+link  failures,  we  add  an  addi¬ 
tional  server.  By  Theorem  2,  this  server  is  necessary.  The  additional  server 
ensures  that  there  is  always  at  least  one  nonfau  v  path  between  any  two 
correct  servers,  where  a  path  contains  zero  or  more  intermediate  servers. 
The  crash  failure  protocol  outlined  above  is  now  modified  so  that  a  primary 
ensures  any  message  sent  to  a  backup  is  sent  across  at  least  one  nonfaulty 
path.  Note  that  this  protocol  uses  (/  +  2)  servers  and  is  0-blocking.  Thus, 
Theorem  2  is  tight  and  this  protocol  uses  the  optimal  number  of  servers  and 
incurs  no  additional  delay.  Furthermore,  this  protocol  has  the  failover  time 
2f6  +  T  for  arbitrarily  small  and  positive  r,  and  so  Theorem  10  is  tight. 

Most  of  our  protocols  for  the  different  kinds  of  omission  failures  apply 
translation  techniques  [NT88]  to  the  crash  failure  protocol.  These  techniques 
ensure  that  a  faulty  server  detects  its  own  failure  and  halts.  The  translations 
assume  a  round- based  protocol.  Since  our  crash  failure  protocol  is  not  round- 
based,  we  must  modify  the  translations  so  that  a  server  can  send  and  receive 
messages  at  any  time  rather  than  just  at  the  beginning  or  the  end  of  a 
round.  This  is  not  difficult  to  do.  All  of  these  resulting  omission  protocols 
have  failover  time  2f,i  -f  r,  and  thus  Theorems  11  and  12  are  tight.  The 
protocol  for  send-omission  failures  uses  /,  -F 1  servers  and  is  26  -F  r-blocking. 
Furthermore,  we  also  have  a  send-omission  protocol  for  /,  =  1  that  is  6- 
blocking.  Thus,  Theorems  7,  8  and  12  are  tight.  The  protocol  for  general- 
omission  failures  also  uses  2/«  -F  1  servers  and  is  26  -F  r-blocking,  and  so 
Theorem  4  is  tight,  and  Theorems  7  and  12  are  tight  for  general-omission 
failures  u  well. 

We  have  not  been  able  to  determine  whether  Theorems  3  and  5  are 
tight.  Our  protocol  tolerating  receive-omission  failures  uses  2/,  -F  1  servers 
whereas  the  lower  bound  in  Theorem  3  only  requires  n,  >  .  We  have 

constructed  receive-omission  protocols  for  n,  =  2,  /,  =  1  and  n,  =  4,  /,  =  2 
but  have  not  been  able  to  generalize  the  protocols.  The  protocols  in  this 
region  have  the  odd  property  that  a  nonfaulty  primary  can  cede  to  a  faulty 
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’  Bound  not  known  to  be  tight. 

^  £>  <  r. 


Table  1:  Lower  Bounds. 


primary,  amd  so  we  do  not  expect  such  protocols  to  have  much  practical 
importance.  However,  the  protocol  for  n,  =  2,  /,  =  1  is  ^-blocking  and  so 
Theorem  6  is  tight. 

Table  I  summarizes  all  of  our  results. 

6  Discussion 

This  paper  gives  a  formal  characterization  of  primary-backup  protocol  for  a 
synchronous  system.  It  presents  lower  bounds  on  the  degree  of  replication, 
the  blocking  time,  and  the  failover  time  for  a  primary-backup  protocol  under 
various  kinds  of  server  and  link  failures.  A  set  of  primary-backup  protocols 
is  outlined  and  used  to  show  which  of  our  lower  bounds  are  tight. 

It  is  instructive  to  compare  our  results  to  existing  primary-backup  pro¬ 
tocols.  A  two-server  primary-backup  protocol  that  tolerates  crash-l-link 
failures  is  presented  in  [BarSl],  which  seemingly  contradicts  Theorem  2. 
However,  this  protocol  assumes  that  there  are  two  links  between  the  two 
servers  which  effectively  masks  a  single  link  failure.  Hence,  only  crash  fail¬ 
ures  need  to  be  tolerated  which  can  be  accomplished  using  only  two  servers 
(Theorem  1). 

A  more  ambitious  primary-backup  protocol  is  presented  in  [LGG'*'91]. 
This  protocol  tolerates  the  following  failure  model  (quoted  from  [LGG'*'9l]): 


19 


The  network  may  lose  or  duplicate  messages,  or  deliver  them  late 
or  out  of  order;  in  addition  it  may  partition  so  that  some  nodes 
are  temporarily  unable  to  send  messages  to  some  other  nodes.  .\5 
is  usual  in  distributed  systems,  we  assume  the  nodes  are  fail-stop 
processors  and  the  network  delivers  only  uncorrupted  messages. 

This  failure  model  is  incomparable  with  the  hierarchy  we  present.  However, 
the  protocol  does  tolerate  general-omission  failures  and  has  optimal  degree 
of  replication  as  it  uses  n,  =  2/j  -t-  1  servers. 

In  Theorem  2.  we  a.ssumed  that  D  <  T.  This  assumption  is  crucial: 
we  have  constructed  a  two-server  primary-backup  protocol  tolerating  one 
crash-blink  failure  for  which  D  >  T.  Recall  that  link  failures  are  masked 
by  adding  redundant  paths  between  the  servers.  Our  two-server  crash-blink 
protocol  essentially  uses  the  path  from  the  primary  to  the  backup  through 
the  client  as  the  redundant  path.  Thus,  there  appears  to  be  a  tradeoff 
between  the  degree  of  replication  and  the  time  it  takes  for  a  client  to  learn 
that  there  is  a  new  primary. 

The  lower  bounds  on  failover  times  given  in  Section  4.3  were  derived 
aissuming  Pb5.  We  have  constructed  primary-backup  protocols  that  have 
failover  times  smaller  than  the  lower  bounds  given  in  Section  4.3,  and  as 
expected  these  protocols  do  not  satisfy  Pb5.  This  smaller  fjulover  time  is 
achieved  at  a  cost  of  an  increased  variance  in  service  response  time. 

Finally,  we  have  attempted  to  give  a  characterization  of  primary-backup 
that  is  broad  enough  to  include  most  synchronous  protocols  that  are  con¬ 
sidered  to  be  instances  of  the  approach.  There  are  protocols,  however,  that 
are  incomparable  to  the  class  of  protocols  we  analyze  [BJ87].  In  addition, 
the  protocols  in  [OL88,  MHS89]  are  incomparable  since  they  were  devel¬ 
oped  for  an  asynchronous  setting.  Such  protocols  cannot  be  cast  in  terms 
of  implementing  a  (fc,  A)-bofo  service  for  finite  values  of  k  and  A.  We  are 
currently  studying  possible  characterizations  for  a  primary-backup  protocol 
in  an  asynchronous  system  and  expect  to  extend  our  results  to  this  setting. 


References 

[AD76)  P.A.  Alsberg  and  J.D.  Day.  A  Principle  for  Resilient  Sharing 
of  Distributed  Resources.  In  Proceedings  of  the  Second  Interna¬ 
tional  Conference  on  Software  Engineering,  pages  627-644,  Oc¬ 
tober  1976. 


20 


BarSlj  J.F.  Barlett.  A  XonStop  Kernel.  In  Proceedings  of  the  Eighth 
ACM  Symposium  on  Operating  System  Principles.  SIGOPS  Op¬ 
erating  System  Review,  volume  15.  pages  22-29.  December  19S1. 

BEM91!  .\nupam  Bhide,  E.N.  Elnozahy.  and  Stephen  P.  Morgan.  .\ 
Highly  .Available  Xetwork  File  Server.  In  i'SEXIX.  pages  199- 
205.  1991. 

[BJ87]  Kenneth  P.  Birman  and  Thomas  .A.  Joseph.  E.xploiting  Virtual 
Synchrony  in  Distributed  Systems.  In  Eleventh  ACM  Symposium 
on  Operating  System  Principles,  pages  123-138.  November  1987. 

[BMST92]  Navin  Budhiraja,  Keith  Marzullo,  Fred  Schneider,  and  Sam 
Toueg.  Optimal  primary-backup  protocols.  Technical  report. 
Cornell  University,  Ithaca,  N.Y.,  1992. 

[Bud93]  Navin  Budhiraja.  Primary  Backup  in  Synchronous  and  Asyn¬ 
chronous  Systems.  PhD  thesis,  Cornell  University,  Department 
of  Computer  Science,  1993.  In  preparation. 

[CASD85]  Flaviu  Cristian,  Houtan  Aghili,  H.  Ray  Strong,  and  Danny 
Dolev.  Atomic  broadcast;  From  simple  message  diffusion  to 
Byzantine  agreement.  In  Proceedings  of  the  Fifteenth  Interna¬ 
tional  Symposium  on  Fault- Tolerant  Computing,  pages  200-206. 
Ann  Arbor,  Michigan,  June  1985.  A  revised  version  appears  as 
IBM  Technical  Report  RJ5244. 

(Cen87]  IBM  International  Technical  Support  Centers.  IBM/ VS  Ex¬ 
tended  Recovery  Facility  (XRF)  Technical  Reference.  Technical 
Report  GG24-3153-0,  IBM,  1987. 

[JB89]  Thomas  Joseph  and  Kenneth  Birman.  Reliable  Broadcast  Proto¬ 
cols,  pages  294-318.  ACM  Press,  New  York,  1989. 

[Lam78]  Leslie  Lamport.  Time,  Clocks,  and  the  Ordering  of  Events  in 
a  Distributed  System.  Communications  of  the  ACM,  21(7):558- 
565,  July  1978. 

[LGG‘^91]  Barbara  Liskov,  Sanjay  Ghemawat,  Robert  Gruber,  Paul  John¬ 
son,  and  Michael  Williams.  Replication  in  the  Harp  file  system. 
In  Proceedings  of  the  ISth  Symposium  on  Operating  System  Prin¬ 
ciples,  pages  226-238,  1991. 


21 


:LMS85] 

[MHS891 

[NT88] 

[OL88] 

[PSL801 

[Sch90] 

[SS831 


Leslie  Lamport  and  P.  M.  Melliar-Smith.  Synchronizing  clocks  in 
the  presence  of  faults.  Journal  of  the  ACM.  32(  l):52-78.  January 
1985. 

Timothy  Mann,  .\ndy  Hisgen,  and  Garret  Swart.  .A.n  .\lgorithm 
for  Data  Replication.  Technical  Report  46.  Digital  Systems  Re¬ 
search  Center,  1989. 

Gil  Neiger  and  Sam  Toueg.  Automaticsdly  increasing  the  fault- 
tolerance  of  distributed  systems.  In  Proceedings  of  the  Sev¬ 
enth  ACM  Symposium  on  Principles  of  Distributed  Computing, 
pages  248-262,  Toronto.  Ontario,  August  1988.  .A.CM  SIGOPS- 
SIGACT. 

B.  Oki  and  Barbara  Liskov.  Viewstamped  replication:  A  new 
primary  copy  method  to  support  highly  available  distributed  sys¬ 
tems.  In  Seventh  ACM  Symposium  on  Principles  of  Distributed 
Computing,  pages  8-17,  august  1988. 

M.  Pease,  R.  Shostak,  and  Leslie  Lamport.  Reaching  agreement 
in  the  presence  of  faults.  Journal  of  the  ACM,  27(2):228-234. 
April  1980. 

Fred  B.  Schneider.  The  state  machine  approach:  .\  tutorial. 
Computing  Surveys,  22(4):299-319,  December  1990. 

Richard  D.  Schlichting  and  Fred  B.  Schneider.  Fail-stop  proces¬ 
sors:  an  approach  to  designing  fault-tolerant  computing  systems. 
ACM  Transactions  on  Computer  Systems,  l(3):222-238.  .August 
1983. 


22 


